-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix liveness and readiness probes #396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timuthy Thanks for the quick fix! Overall LGTM except for a small nit.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @timuthy!
One comment from me
Thanks for the reviews @shreyas-s-rao and @aaronfern 🚀 I fixed the type, PTAL. |
@timuthy the unit tests for sts component are failing because the probes in |
/hold |
Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>
Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods
The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods
The enablement of startup/liveness probes through #396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods Co-authored-by: Tim Usner <tim.usner@sap.com>
The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods
How to categorize this PR?
/area quality
/kind bug
What this PR does / why we need it:
This PR fixes several issues for the currently used
liveness
andreadiness
checks and also adds astartup
probe for single- and multi-node etcds.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Issue with current liveness probe:
/bin/sh -ec
then only the first argument is considered, i.e./bin/sh -ec ETCDCTL_API=3
is always successful.Issue with current readiness probe:
exec
command did not evaluate the return code of the HTTP response and thus the container was consideredready
even though the/health(z)
endpoint returned!= 200
.For single-node it's possible to switch to an http probe to solve the explained issue.
For multi-node it's necessary to use an exec probe because the
/health
endpoint ofetcd
is protected by mutual TLS (also see etcd-io/etcd#12370) and providing a client cert is not supported via Kubernetes http probes.The new liveness probe is now accurate, but still fails after few seconds due to the backup sidecar requiring a long time to promote its etcd member from learner. This leads to the etcd container being restarted, and the backup sidecar's initialization fails, and begins another initialization when the etcd container comes back up. This cycle continues, and the etcd pods never become ready. This problem is solved by using a startup probe of 2 minutes to allow the initialization to complete without interruptions due to etcd container restarts.
Release note: